{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes with differential privacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by importing the required libraries and modules and collecting the data that we need from the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import diffprivlib.models as dp\n",
    "import numpy as np\n",
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = np.loadtxt(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",\n",
    "                        usecols=(0, 4, 10, 11, 12), delimiter=\", \")\n",
    "\n",
    "y_train = np.loadtxt(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",\n",
    "                        usecols=14, dtype=str, delimiter=\", \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also collect the test data from Adult to test our models once they're trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = np.loadtxt(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",\n",
    "                        usecols=(0, 4, 10, 11, 12), delimiter=\", \", skiprows=1)\n",
    "\n",
    "y_test = np.loadtxt(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",\n",
    "                        usecols=14, dtype=str, delimiter=\", \", skiprows=1)\n",
    "# Must trim trailing period \".\" from label\n",
    "y_test = np.array([a[:-1] for a in y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive Bayes with no privacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To begin, let's first train a regular (non-private) naive Bayes classifier, and test its accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GaussianNB()"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nonprivate_clf = GaussianNB()\n",
    "nonprivate_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Non-private test accuracy: 79.64%\n"
     ]
    }
   ],
   "source": [
    "print(\"Non-private test accuracy: %.2f%%\" % \n",
    "     (nonprivate_clf.score(X_test, y_test) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Differentially private naive Bayes classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `models.GaussianNB` module of diffprivlib, we can train a naive Bayes classifier while satisfying differential privacy.\n",
    "\n",
    "If we don't specify any parameters, the model defaults to `epsilon = 1` and selects the model's feature bounds from the data. This throws a warning with `.fit()` is first called, as it leaks additional privacy. To ensure no additional privacy loss, we should specify the bounds as an argument, and choose the bounds indepedently of the data (i.e. using domain knowledge)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "dp_clf = dp.GaussianNB()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you re-evaluate this cell, the test accuracy will change. This is due to the randomness introduced by differential privacy. Nevertheless, the accuracy should be in the range of 87–93%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Differentially private test accuracy (epsilon=1.00): 79.33%\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      ".../site-packages/diffprivlib/models/naive_bayes.py:100: PrivacyLeakWarning: Bounds have not been specified and will be calculated on the data provided. This will result in additional privacy leakage. To ensure differential privacy and no additional privacy leakage, specify bounds for each dimension.\n",
      "  \"privacy leakage, specify bounds for each dimension.\", PrivacyLeakWarning)\n"
     ]
    }
   ],
   "source": [
    "dp_clf.fit(X_train, y_train)\n",
    "\n",
    "print(\"Differentially private test accuracy (epsilon=%.2f): %.2f%%\" % \n",
    "      (dp_clf.epsilon, dp_clf.score(X_test, y_test) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By setting `epsilon=float(\"inf\")` we get an identical model to the non-private naive Bayes classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Agreement between non-private and differentially private (epsilon=inf) classifiers: 100.00%\n"
     ]
    }
   ],
   "source": [
    "dp_clf = dp.GaussianNB(epsilon=float(\"inf\"), bounds=(-1e5, 1e5))\n",
    "dp_clf.fit(X_train, y_train)\n",
    "\n",
    "print(\"Agreement between non-private and differentially private (epsilon=inf) classifiers: %.2f%%\" % \n",
    "     (dp_clf.score(X_test, nonprivate_clf.predict(X_test)) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Changing `epsilon`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On this occasion, we're going to specify the `bounds` parameter as a list of tuples, indicating the ranges in which we expect each feature to lie."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "bounds = ([17, 1, 0, 0, 1], [100, 16, 100000, 4500, 100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also specify a value for `epsilon`. High `epsilon` (i.e. greater than 1) gives better and more consistent accuracy, but less privacy. Small `epsilon` (i.e. less than 1) gives better privacy but worse and less consistent accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GaussianNB(accountant=BudgetAccountant(spent_budget=[(1.0, 0), (inf, 0), (0.1, 0)]),\n",
       "           bounds=(array([17.,  1.,  0.,  0.,  1.]),\n",
       "                   array([1.0e+02, 1.6e+01, 1.0e+05, 4.5e+03, 1.0e+02])),\n",
       "           epsilon=0.1)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dp_clf2 = dp.GaussianNB(epsilon=0.1, bounds=bounds)\n",
    "\n",
    "dp_clf2.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Differentially private test accuracy (epsilon=0.10): 79.34%\n"
     ]
    }
   ],
   "source": [
    "print(\"Differentially private test accuracy (epsilon=%.2f): %.2f%%\" % \n",
    "     (dp_clf2.epsilon, dp_clf2.score(X_test, y_test) * 100))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
